DATA SCIENCE SESSIONS VOL. 3

A Foundational Python Data Science Course

Session 15: Regularization in MLR. The Maximum Likelihood Estimation (MLE).

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


Let's first import all of the necessary libraries for later.

0. The Problem of Overfitting

Overfitting is a common problem in machine learning where a model learns to fit the training data too closely, leading to poor performance on new or unseen data. In other words, the model becomes too complex and starts to memorize the training data instead of learning the underlying patterns and relationships.

Overfitting occurs when a model is trained on a limited set of data, and the model is too complex, with too many parameters or features, to generalize to new data. This can lead to a model that performs very well on the training data but poorly on new data.

Overfitting can be identified by comparing the model's performance on the training data to its performance on a validation set or test data. If the model performs significantly better on the training data than on the validation set, then it is likely overfitting.

To prevent overfitting, one can use techniques such as regularization, early stopping, or reducing the complexity of the model. Regularization techniques, such as L1 or L2 regularization, can penalize the model for having too many features or high parameter values, while early stopping can stop the training process before the model starts to overfit. Additionally, reducing the complexity of the model, such as by decreasing the number of features or using a simpler model architecture, can also help prevent overfitting.

Let's, for the sake of the example, generate a sample of bivariate distribution.

Now, shall we plot values of y as a function of x?

This is all fine. However, how about we decalare this bivariate distribution as a population, and sample 10k samples of 100 observations and do the same thing i.e. get the intercept and slope parameters for this linear model.

We will store parameter values with the value of R2, in dedicated Pandas DataFrame for easier display.

What is the distribution of R2 metric for all of the samples?

And how did each of linear regression models look like? Here are the first 12 samples linear regression models, with observations.

Note how for each sample intercepts and slopes are different, and how R2 is also varying. Why? We have sampled from the same 'population'.

1. Regularization in Multiple Linear Regression

1.1 Lasso (L1), Ridge (L2), and Elastic Net regression

Regularization is used for several purposes:

Consider the following:

$$Loss = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}$$

It is just the ordinary sum of squares, or the loss (or cost) function of the Linear Regression Model. The parameters $\beta$ of the Linear Model are estimate by minimizing the above quantity - the model's cost function. The expression above is just the ordinary $SSE$ function as we have encountered it before in our sessions on Linear Regression.

Now, consider the following formulation of the cost function that includes the penalty term:

$$Loss_{L2} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}\beta_j^2$$

$\lambda\sum_{j=1}^{p}\beta_j^2$ is the penalty term: it increases the value of the cost function by a factor determined by the sum of the squared model coefficients. In the expression above $p$ stands for the number of model parameters (coefficients, in the case of Linear Regression) and $\beta$ are the coefficients themselves. The above expression represents the $L2$ regularization for the Linear Model, also known as Ridge Regression. If $\lambda$ is zero we get back to the Ordinary Least Squares model that we have already learned about. If $\lambda$ is too large, it will add too much weight to the cost function which will lead to underfitting.

The underlying idea is to penalize large coefficients: it regularizes the coefficients so that if they take large values the cost function is penalized. In effect, the Ridge regression shrinks the coefficients so to reduce the model complexity and reduces the effects multicollinearity.

Consider now the following modification of the cost function:

$$Loss_{L1} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}|\beta_j|$$

The above cost function is based on a weighted sum of the absolute values of the model coefficients. This approach is known as $L1$ or Lasso Regression.

The most important difference between Ridge and Lasso regularization is that in Lasso some coefficients can be shrinked to zero and get completely eliminated from the model. Thus, Lasso regularization does not only prevent overfitting but also performs feature selection - a very handy thing when considering large linear models.

Finally, consider the following formulation of a loss (or cost) function:

$$Loss_{ElasticNet} = \frac{\sum_{i=1}^{n}(y_i-\hat{y_i})^2}{2n}+\lambda(\frac{1-\alpha}{2}\sum_{j=1}^{m}\hat{\beta_j}^2+\alpha\sum_{j=1}^{m}|\hat{\beta_j}|)$$

In the expression above - the Elastic Net regression - $\alpha$ is the mixing parameter between Lasso ($\alpha=1$) and Ridge ($\alpha=0$) approach. While the Ridge regression shrinks the coefficients of correlated predictors towards each other, the Lasso regression tends to pick one of them and decrease the others. The Elastic Net penalty is a mixture of the two approaches controlled by $\alpha$.

1.2 Inspect Multicolinearity in one particular Linear Regression problem

Here we are going to use King County, Washington State, USA housing data set. It is good for predicting house prices using MLR (multiple linear regression).

What is the nature of our data set?

Target: predict price from the numerical predictors

Let's use just subset of all given variables as predictors.

We have to exclude predicted (target) variable from the training data set.

Let's define formula for our model by introducing chosen variables names.

And now fit the model using OLS.

It is always good to include VIF information for variables, just for better overview.

1.3 Regularization of the model coefficients: Ridge (L2) Regularization

Now we are going to use scikit-learn to perform Regularized Multiple Linear Regression; we'll be using the Ridge Regression.

As we have see, the Ridge Regression model is obtained by minimizing the function

$$SSE + \alpha L_2^2,$$

where $SSE$ is the Sum Squarred Error of the 'ordinary' Multiple Linear Regression

$$\hat{y} = \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0,$$

and $L_{2}^2$ is the squared $L_{2}-$ norm of Multiple Linear Regression model parameters $\beta_1,\beta_2,\ldots,\beta_k$:

$$L_2^2 = \beta_1^2 + \beta_2^2 + \cdots + \beta_k^2,$$

and hence the name L2 Regularization. $\alpha$ is called penalization (hyper)parameter, and it's used to control the magnitude of models parameters; for $\alpha = 0$ we recover ordinary MLR.

Split data into two parts, predictor and predicted variable data sets.

Now we are fitting sklearn's Ridge model, for Ridge regression.

How about we change the alpha value and see what happens with model parameters?

And what about how scaling alpha affects $L_2$ norm, MSE and R2 scores?

We see that the best model is obtained with small penalization hyperparameter, i.e. with large model parameter (coefficient) values.

1.4 LASSO (L1) Regularization

LASSO is another way to regularize MLR model. It's obtained by minimizing the function

$$SSE + \alpha L_1,$$

where $SSE$ is Sum Squared Error of the ordinary Multiple Linear Regression and $L_1$ is $L_1-$ norm of its coeffitents given by

$$L_1 = |\beta_1| + |\beta_2| + \cdots + |\beta_k|,$$

and hence the name L1 Regularization. $\alpha$ is, of course, penalization hyperparameter.

Here we are using sklearn's LASSO model, for Lasso regularized model.

And what about how scaling alpha affects $L_2$ norm, MSE and R2 scores?

We see that the best model is obtained with small penalization hyperparameter, i.e. with big model parameters values.

2. Maximum Likelihood Estimator (MLE)

Maximum Likelihood Estimation (MLE) is yet another approach in estimating model parameters. It is a way of estimating the parameter(s) of probability distribution that generated the sample that we have at hand. Before showing the examples of how parameters values are estimated using MLE, let's refresh some of the things we talked about during the session 11.

Back then, we have introduced something that is called Bayes' Theorem. Given that we have two events $A$ and $B$, and given that $P(A)\neq0$ we can define:

$$ P(B|A) = \frac{P(A|B)P(B)}{P(A)}.$$

Also, we said that $P(A|B)$ is also called likelihood, and introduced another way of naming it. Namely $\mathcal{L}(B|A)$. If you remember we said that:

$$\mathcal{L}(B|A) = P(A|B).$$

So, the likelihood of the event $B$ occuring given that the event $A$ has occured is equal to the conditional probability of event $A$ occuring under the condition that the event $B$ has occured.

If we speak in terms of random varable and distribution parameters we usually write:

$$P(\theta; X) = \frac{\mathcal{L}(\theta; X)P(\theta)}{P(X)}$$

and

$$\mathcal{L}(\theta; X) = P(X; \theta).$$

where $X$ is RV and $\theta$ are its distributions' parameters (when there is a single parameter $\theta$ is scallar, and when there are multiple parameters ${\theta}$ is a vector).

Now, imagine we have RV $X$ sample of 3 observations $x_1, x_2, x_3$. In this case, previously rewritten Bayes' Theorem would be:

$$P(\theta; x1, x2, x3) = \frac{\mathcal{L}(\theta; x_1, x_2, x_3)P(\theta)}{P(x_1, x_2, x_3)}.$$

If we focus just on $\mathcal{L}(\theta; x_1, x_2, x_3)$ term, and use already defined equality $\mathcal{L}(\theta; X) = P(X; \theta)$, knowing that $x_1, x_2$ and $x_3$ are statistically independent, we have:

$$\mathcal{L}(\theta; x_1, x_2, x_3) = P(x_1, x_2, x_3; \theta) = P(x_1; \theta) * P(x_2; \theta) * P(x_3; \theta). $$

In general, if we have the sample of N observations, we have:

$$\mathcal{L}(\theta; x_1, x_2,..., x_N) = \prod_{i=1}^{N}{P(x_i; \theta)}. $$

This multiplication can easily get computationally intractable in practice. This is the reason why we define log-likelihood as:

$$\mathcal{l}(\theta; x_1, x_2,..., x_N) = log(\prod_{i=1}^{N}{P(x_i; \theta)}) = \sum_{i=1}^{N}{log(P(x_i; \theta))}.$$

This works because the logarithm is a strictly increasing function, and by maximizing the log-likelihood we are in essence maximizing likelihood.

And now, the examples.

Tossing a coin

Imagine I have a fair coin: the one that lands Heads and Tails with equal probabilities P(H) = P(T) = 1/2.

Uhm, but I have already assumed that we will be tossing a fair coin? Four heads, six tails, how come? Is this really a fair coin? The answer is: it still might be. Let's see.

What about now?

Let's create 1000 samples and check the proportions of heas and tails.

Let's plot the distribution of these probabilities.

Ok, now let' see what happens with 10,000 statistical experiments with a fair coin:

What about 100,000 experiments?

Theoretical probability: when we know that some stochastic system produces an outcome with some particular probability.

Experimental probability: when we estimate that some stochastic system produces an outcome with some particular probability from observed data.

Reverse engineering the coin

In all my previous statistical experiments, I have assumed that the coin is fair, e.g. P(H) = P(T) = .5, look:

tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])

I have used numerical simulations of coin tosses to demonstrate that with increasing number of experiments I get to estimate the known probabilities of P(H) and P(T) better and better.

However, we typically do not know the theoretical probability of a random event... Example: the CTR of some social media post.

CTR (Click-Through Rate) is like a coin toss: a post is clicked or not clicked upon each served impression. We can thus threat the CTR like a coin toss: P(H) = P(Clicked) is "post clicked", P(T) = P(Not clicked) is "not clicked". Then we can say that P(Clicked) is some latent, essential characteristic of post.

Imagine we have a set of observations for a particular social media post: 1,1,1,0,0,0,1,0,1,0,1,0,0,0,1,0.., where 1 = Clicked and 0 = Not clicked. How do we estimate the CTR of the post from these data?

Easy, right: sum up the vector of observations and divide by its length. But wait. There's more to it.

I have just invented some social media post with a CTR of .3, and estimated its experimental probability to be .3035; that is the best estimate that I currently have. What if I want to test a set of hypotheses about the CTR of this post? What, for example, if I am intersted to learn about the probability of this post having a CTR of .25 instead?

Let's use a simpler, smaller set of observations to begin with.

Imagine we toss a fair coin with P(H)=.5 twice and observe two Heads. The probability of observing two heads with P(H)=.5 is:

.5 * .5 = .25

But what if the coin was not fair, and instead had a probability of P(H)=.3? Then:

.3 * .3 = 0.09

And what if the coin had P(H) = .7?

.7 * .7 = 0.49

With P(H) = .9 we get:

.9 * .9 = 0.81

and it seems intuitive and logical: if we have observed a sample - however small it was - of all heads (and are assuming that we have observed two Heads from two tosses here), it makes sense that we are dealing with a coin that almost always lands Heads.

Let's express this intuition visually. First, we need a function to compute the probability that a coin with a certain P(H) produced the data.

Let's define a helper function for coin tossing. It is going to give us probability of observation, based on probability of getting Heads. If we say $P(H)=.2$ then each time we get Heads, function returns 0.2.

Remember that we have defined likelihood as multiplication of each observation's probability? Here we define function do to just that.

Nice. Now, let's define our parameter's domain. Remember, we are trying to find maximum likelihood for probability of getting Heads ($P(H)$). Since this is a probability, domain is range $[0, 1]$.

Now for Log-Likelihood:

Let's ask explicitly:

And what if P(H) was .85?

Pick any arbitrary observation

Good. What have we learned?

To estimate the parameter of a simple statistical model from data - a coin, in our case, a simple stochastic system with two outcome states only - we need to find the maximum of its likelihood function given the observed data.

For our example with the HH minimal observations - tossing a coin two times and observing two Heads in a row - the likelihood function is:

Of course: however small the sample size, the sample represents just everything that we know about a coin; if we observe all heads, like in: HH, then the most probable coin that could have produced the result is definitely the one that yields Heads all the time!

2.1 MLE in Linear Regression

Let's use Iris data set for another example.

Let's plot petal length as a function of sepal length.

How about we use optimization to get the model's parameters?

Further Reading


DataKolektiv, 2022/23.

hello@datakolektiv.com

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.